graph LR
A["Single-dimension<br/>factuality tests"] --> B["Incomplete picture<br/>of model accuracy"]
B --> C["FACTS Benchmark Suite<br/>4 benchmarks, 3,513 examples<br/>Best model < 68%"]
C --> D["Systematic measure<br/>of LLM factuality"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
FACTS Benchmark Suite
A comprehensive benchmark suite from Google DeepMind for systematically evaluating the factuality of LLMs across grounding, parametric knowledge, search, and multimodal tasks
Keywords: FACTS Benchmark Suite, FACTS Grounding, factuality, hallucination, LLM evaluation, Google DeepMind, Google Research, Kaggle, grounding accuracy, parametric knowledge, search benchmark, multimodal factuality

Introduction
LLMs are increasingly becoming a primary source for information delivery — from answering questions to summarizing documents to analyzing images. But their grip on factual accuracy remains imperfect. They “hallucinate” false information, particularly when given complex inputs, eroding trust and limiting real-world applications.
Most benchmarks test knowledge or reasoning in isolation. But factuality failures happen in many different ways: a model might hallucinate when answering from memory, fail to ground its response in a provided document, retrieve the wrong information from the web, or misinterpret an image. Testing only one dimension gives an incomplete picture.
The FACTS Benchmark Suite addresses this by systematically evaluating LLM factuality across four distinct dimensions: grounding, parametric knowledge, search, and multimodal understanding. No model scores above 68% on the overall suite, revealing substantial room for improvement.
What Is the FACTS Benchmark Suite?
The FACTS Benchmark Suite is a collection of four complementary benchmarks designed to evaluate the factual accuracy of LLMs across different use cases. Each benchmark tests a distinct factuality capability, and the FACTS Score is the average accuracy across all four.
graph TD
FACTS["FACTS Benchmark Suite<br/>3,513+ examples"] --> G["Grounding<br/>1,719 examples"]
FACTS --> P["Parametric<br/>2,104 examples"]
FACTS --> S["Search<br/>1,884 examples"]
FACTS --> M["Multimodal<br/>1,522 examples"]
G --> G1["Ground responses<br/>in provided documents<br/>Up to 32K tokens"]
P --> P1["Answer factoid questions<br/>from internal knowledge<br/>No external tools"]
S --> S1["Use web search<br/>to retrieve and synthesize<br/>Multi-hop queries"]
M --> M1["Answer questions<br/>about input images<br/>Visual + world knowledge"]
style FACTS fill:#e74c3c,color:#fff,stroke:#333
style G fill:#3498db,color:#fff,stroke:#333
style P fill:#27ae60,color:#fff,stroke:#333
style S fill:#f39c12,color:#fff,stroke:#333
style M fill:#8e44ad,color:#fff,stroke:#333
The Four Benchmarks
1. FACTS Grounding (v2)
Tests whether LLMs can generate factually accurate responses grounded in provided long-form documents (up to 32K tokens). Each example includes a system instruction, user request, and context document requiring a long-form response. Responses must be both comprehensive (addressing the user’s request) and fully grounded (no hallucinated claims).
- 1,719 examples (860 public + 859 private)
- Domains: finance, technology, retail, medicine, law
- Tasks: summarization, Q&A generation, rewriting
2. FACTS Parametric
Tests the model’s ability to access its internal knowledge accurately in factoid question use-cases — without the aid of external tools like web search. Questions are “trivia-style” driven by user interest, answerable via Wikipedia.
- 2,104 examples (1,052 public + 1,052 private)
- Diverse domains and answer types
- Example: “Who played harmonica on ‘The Rockford Files’ theme song?”
3. FACTS Search
Tests a model’s ability to use web search as a tool to retrieve information and synthesize it correctly. Designed to be challenging even with web access, often requiring multi-hop retrieval — finding multiple facts sequentially to answer a single query. The same search tool is provided to all models to isolate model capability from tool quality.
- 1,884 examples (890 public + 994 private)
- Multi-hop queries requiring sequential fact retrieval
- Example: “What is the sum of the birth years of the British boxer who defeated Vazik Kazarian at the 1960 Summer Olympics, the Moroccan boxer who also competed… and the Danish boxer who competed in both the 1960 and 1964 Summer Olympics?”
4. FACTS Multimodal
Tests a model’s ability to answer questions about input images in a factually correct manner. Requires integrating visual grounding (accurately interpreting visual input) with internal world knowledge.
- 1,522 examples (711 public + 811 private)
- Diverse image types and question categories
- Example: An image of a moth with the prompt “What genus does this animal belong to?”
Key Characteristics
| Feature | Details |
|---|---|
| Total examples | 3,513+ across 4 benchmarks (public + private) |
| Benchmarks | Grounding, Parametric, Search, Multimodal |
| Document length | Up to 32K tokens (Grounding) |
| Evaluation | Ensemble of 3 frontier LLM judges |
| Anti-gaming | Quality filtering disqualifies evasive responses |
| Anti-contamination | Private held-out sets for each benchmark |
| FACTS Score | Average accuracy across all 4 benchmarks |
| Hosted by | Kaggle (independent reproduction) |
Who Built It?
The FACTS Benchmark Suite was developed by Google DeepMind and Google Research, in partnership with Kaggle for hosting and independent result reproduction.
Lead Contributors
The FACTS team includes:
- Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Dipanjan Das — Core FACTS team
With support from senior leadership including Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias.
Evolution
| Date | Milestone |
|---|---|
| December 2024 | FACTS Grounding v1 launched with leaderboard on Kaggle |
| January 2025 | Technical report published (arXiv:2501.03200) |
| December 2025 | FACTS Benchmark Suite launched (4 benchmarks), Grounding updated to v2 |
| Ongoing | Leaderboard actively maintained and updated by Kaggle |
| Resource | Link |
|---|---|
| Google DeepMind Blog | deepmind.google/blog/facts-benchmark-suite… |
| FACTS Grounding Paper (v1) | arxiv.org/abs/2501.03200 |
| FACTS Grounding Paper (v2) | arxiv.org/abs/2512.10791 |
| FACTS Benchmark Suite Paper | PDF (Google DeepMind) |
What Skills Does It Test?
The FACTS Benchmark Suite tests the complete factuality pipeline of LLMs — from internal knowledge recall to document grounding to web-based retrieval to visual understanding. This multi-dimensional approach reveals that models can excel in one dimension while failing in another.
graph TD
FACTS["FACTS Suite<br/>Factuality Skills"] --> IK["Internal Knowledge<br/>(Parametric)"]
FACTS --> DG["Document Grounding<br/>(Grounding)"]
FACTS --> WR["Web Retrieval<br/>(Search)"]
FACTS --> VU["Visual Understanding<br/>(Multimodal)"]
FACTS --> QF["Quality & Completeness<br/>(All benchmarks)"]
FACTS --> AH["Anti-Hallucination<br/>(All benchmarks)"]
style FACTS fill:#e74c3c,color:#fff,stroke:#333
style IK fill:#3498db,color:#fff,stroke:#333
style DG fill:#27ae60,color:#fff,stroke:#333
style WR fill:#f39c12,color:#fff,stroke:#333
style VU fill:#8e44ad,color:#fff,stroke:#333
style QF fill:#e67e22,color:#fff,stroke:#333
style AH fill:#6cc3d5,color:#fff,stroke:#333
| Capability | Benchmark | What It Tests |
|---|---|---|
| Internal knowledge | Parametric | Accurate recall of factual information from training data |
| Document grounding | Grounding | Generating responses fully supported by provided context |
| Information retrieval | Search | Using web search to find and synthesize facts correctly |
| Visual reasoning | Multimodal | Answering factual questions about images |
| Response quality | All | Providing comprehensive, useful responses (not evasive) |
| Anti-hallucination | All | Avoiding fabricated claims not supported by evidence |
Evaluation Methodology
FACTS uses an ensemble of 3 frontier LLM judges (originally Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) to evaluate responses. This multi-judge approach mitigates evaluation bias. Each response is evaluated in two phases:
- Eligibility — Is the response a comprehensive answer to the user’s request? (Disqualified only if all 3 judges agree it’s “ineligible”)
- Factuality — Is the response fully grounded / factually correct? (Average of all 3 judges’ scores)
Current Leaderboard
FACTS Benchmark Suite (Overall)
The overall FACTS Score is the average accuracy across all four benchmarks. Results are independently reproduced by Kaggle.
Source: FACTS Benchmark Suite Leaderboard on Kaggle (consulted March 28, 2026). Last updated March 25, 2026.
| Rank | Model | FACTS Score | Grounding | Multimodal | Search | Parametric |
|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 67.7% | 65.0% | 41.3% | 85.6% | 78.9% |
| 2 | Gemini 2.5 Pro | 62.1% | 74.3% | 46.9% | 63.9% | 63.2% |
| 3 | GPT-5 | 61.8% | 69.6% | 44.1% | 77.7% | 55.8% |
| 4 | Gemini 3 Flash Preview | 60.4% | 59.0% | 41.3% | 81.0% | — |
| 5 | Gemini 3.1 Flash-Lite Preview | 57.6% | 66.5% | 39.4% | 66.8% | — |
| 6 | GPT-5.2 | 54.4% | 76.2% | 39.7% | 72.2% | 29.7% |
| 7 | Grok 4 | 53.6% | 54.7% | 25.7% | 75.3% | 58.6% |
| 8 | o3 | 52.0% | 36.2% | 39.9% | 74.8% | 57.1% |
| 9 | Claude Opus 4.5 | 51.3% | 62.1% | 39.2% | 73.2% | 30.6% |
FACTS Grounding (Standalone)
The Grounding benchmark has the largest number of evaluated models (66+). Top performers:
Source: FACTS Grounding Leaderboard on Kaggle (consulted March 28, 2026).
| Rank | Model | Score | Public | Private |
|---|---|---|---|---|
| 1 | GPT-5.2 | 76.2% ± 2.0 | 77.3% | 75.1% |
| 2 | Gemini 2.5 Pro | 74.3% ± 2.1 | 74.3% | 74.3% |
| 3 | Llama 3 – Grounded LM | 71.8% ± 2.1 | 72.0% | 71.5% |
| 4 | Gemini 2.5 Flash | 70.0% ± 2.2 | 70.5% | 69.5% |
| 5 | GPT-5 | 69.6% ± 2.2 | 69.3% | 70.0% |
| 6 | Gemini 3.1 Flash-Lite | 66.5% ± 2.2 | 67.4% | 65.7% |
| 7 | Gemini 3.1 Pro Preview | 65.0% ± 2.3 | 65.9% | 65.5% |
| 8 | Claude Opus 4.5 | 62.1% ± 2.3 | 64.4% | 59.8% |
| 9 | Claude Sonnet 4.5 (thinking) | 61.8% ± 2.3 | 64.5% | 59.1% |
Key Observations
graph LR
A["Multimodal is hardest<br/>Best: 46.9%<br/>(Gemini 2.5 Pro)"] --> C["Factuality remains<br/>an unsolved problem"]
B["Search is strongest<br/>Best: 85.6%<br/>(Gemini 3.1 Pro)"] --> C
D["No model > 68%<br/>overall FACTS Score"] --> C
C --> E["Multi-dimensional<br/>evaluation is essential"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#27ae60,color:#fff,stroke:#333
style D fill:#f39c12,color:#fff,stroke:#333
style E fill:#3498db,color:#fff,stroke:#333
- No model exceeds 68% overall — Even the best model (Gemini 3.1 Pro Preview at 67.7%) leaves substantial room for improvement
- Multimodal is the hardest dimension — The best multimodal score is just 46.9%, far below other dimensions
- Search is the strongest dimension — Models score up to 85.6% when given web search tools
- Grounding and Parametric vary widely — Some models excel at grounding (GPT-5.2 at 76.2%) but struggle with parametric knowledge (29.7%)
- Specialization vs generalization — Models that top one benchmark often trail in others, highlighting the need for multi-dimensional evaluation
Where to Explore the Benchmark
Leaderboards on Kaggle
| Resource | Description | Link |
|---|---|---|
| FACTS Suite Leaderboard | Overall ranking across all 4 benchmarks | kaggle.com/benchmarks/google/facts |
| FACTS Grounding | Standalone grounding leaderboard (66+ models) | kaggle.com/benchmarks/google/facts-grounding |
| FACTS Parametric | Standalone parametric knowledge leaderboard | kaggle.com/benchmarks/google/facts-parametric |
| FACTS Search | Standalone search-augmented leaderboard | kaggle.com/benchmarks/google/facts-search |
| FACTS Multimodal | Standalone multimodal factuality leaderboard | kaggle.com/benchmarks/google/facts-multimodal |
Dataset and Code
| Resource | Description | Link |
|---|---|---|
| FACTS Grounding Public Dataset | 860 public examples for self-evaluation | kaggle.com/datasets/deepmind/facts-grounding-examples |
| Starter Notebook | Kaggle notebook for running FACTS Grounding v2 | kaggle.com/code/prathameshbang/facts-grounding-v2-benchmark-starter |
| FACTS Grounding Paper (v1) | Technical report with methodology | arxiv.org/abs/2501.03200 |
| FACTS Grounding Paper (v2) | Updated judges and methodology | arxiv.org/abs/2512.10791 |
| FACTS Suite Paper | Full technical report for all 4 benchmarks | |
| DeepMind Blog (Grounding) | Blog post introducing FACTS Grounding | deepmind.google/blog/facts-grounding… |
| DeepMind Blog (Suite) | Blog post introducing the full FACTS Suite | deepmind.google/blog/facts-benchmark-suite… |
Submit Your Model
To request evaluation of a new model on the full FACTS leaderboard (including private held-out sets), fill out the submission form. Official results are run by the Kaggle team to ensure integrity.
Why FACTS Matters
graph LR
A["Single-dimension<br/>factuality tests"] --> B["Models appear<br/>more accurate<br/>than they are"]
B --> C["FACTS Suite<br/>exposes blind spots"]
C --> D["Trustworthy<br/>LLM deployment"]
A2["Hallucination<br/>is multi-faceted"] --> B2["Grounding ≠ Knowledge<br/>≠ Search ≠ Vision"]
B2 --> C
C --> D2["Targeted research<br/>on each dimension"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- Multi-dimensional evaluation — Tests grounding, parametric knowledge, search, and multimodal factuality in one suite
- Reveals specialization gaps — Models excelling at grounding may fail at parametric knowledge, and vice versa
- Anti-gaming design — Quality filtering prevents evasive short responses; private held-out sets guard against contamination
- Multi-judge evaluation — Ensemble of 3 frontier LLM judges mitigates scoring bias
- Independently hosted — Kaggle independently reproduces all results, ensuring integrity
- Actively maintained — Leaderboard continuously updated with new models and benchmark improvements
Video: FACTS Benchmark Suite Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
The FACTS Benchmark Suite provides the most comprehensive evaluation of LLM factuality available today:
- 4 benchmarks covering grounding, parametric knowledge, search, and multimodal factuality
- 3,513+ examples across diverse domains (finance, technology, medicine, law, retail)
- Built by Google DeepMind and Google Research, hosted and independently verified by Kaggle
- The best model scores 67.7% overall — substantial headroom for improvement remains
- Multimodal factuality is the weakest dimension across all models (best: 46.9%)
- Grounding and parametric knowledge show wide variance — models that ground well can fail at knowledge recall, and vice versa
As LLMs become primary information sources, the FACTS Benchmark Suite ensures we can measure not just whether models know facts, but how reliably they use them — whether from internal knowledge, provided documents, web search, or visual inputs.
“We hope this work encourages deeper research into LLM factuality, leading to better and more accurate models and products for the people that rely on them.” — Google DeepMind FACTS Team
References
- Jacovi, A., Wang, A., Alberti, C., Tao, C., Lipovetz, J., Olszewska, K., Haas, L., Liu, M., Keating, N., Das, D. “The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input.” arXiv preprint arXiv:2501.03200 (2025). arxiv.org/abs/2501.03200
- Google DeepMind. “FACTS Grounding v2 Technical Report.” arXiv preprint arXiv:2512.10791 (2025). arxiv.org/abs/2512.10791
- Google DeepMind. “FACTS Benchmark Suite Paper.” PDF
- Google DeepMind. “FACTS Grounding: A new benchmark for evaluating the factuality of large language models.” deepmind.google/blog/facts-grounding… (December 2024)
- Google DeepMind. “FACTS Benchmark Suite: Systematically evaluating the factuality of large language models.” deepmind.google/blog/facts-benchmark-suite… (December 2025)
- Google DeepMind & Kaggle. “FACTS Benchmark Suite Leaderboard.” kaggle.com/benchmarks/google/facts (consulted March 28, 2026)
- Google DeepMind & Kaggle. “FACTS Grounding Leaderboard.” kaggle.com/benchmarks/google/facts-grounding (consulted March 28, 2026)
Read More
- Compare with the hardest academic benchmark — see Humanity’s Last Exam (HLE)
- Compare with the AGI fluid intelligence benchmark — see ARC-AGI-2
- Compare with the chart understanding benchmark — see CharXiv Reasoning
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
- Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
- FACTS Suite Leaderboard on Kaggle
- FACTS Grounding Leaderboard on Kaggle
- FACTS Grounding Public Dataset